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Abstract In supervised learning, simple baseline classifiers can be constructed by 
only looking at the class, i.e., ignoring any other information from the dataset. The 
single-label learning community frequently uses as a reference the one which always 
predicts the majority class. Although a classifier might perform worse than this 
simple baseline classifier, this behaviour requires a special explanation. Aiming to 
motivate the community to compare experimental results with the ones provided 
by a multi-label baseline classifier, calling the attention about the need of special 
explanations related to classifiers which perform worse than the baseline, in this 
work we propose the use of Generals, a multi-label baseline classifier. Generals 
was evaluated in contrast to results published in the literature which were carefully 
selected using a systematic review process. It was found that a considerable number 
of published results on 10 frequently used datasets are worse than or equal to 
the ones obtained by Generals, and for one dataset it reaches up to 43% of the 
dataset published results. Moreover, although a simple baseline classifier was not 
considered in these publications, it was observed that even for very poor results 
no special explanations were provided in most of them. We hope that the findings 
of this work would encourage the multi-label community to consider the idea of 
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using a simple baseline classifier, such that further explanations are provided when 
a classifiers performs worse than a baseline. 

Keywords machine learning • multi-label classification • multi-label baseline 
classifier • systematic review 


1 Introduction 


Given a set of examples (instances) characterized by the value of attributes and 
the class of the example, the aim of supervised learning algorithms is to construct 
a classifier which is able to assign new examples to the class they belong to. To 
this end, a great deal of learning algorithms have been proposed. The constructed 
classifiers are usually compared over a variety of datasets using various evaluation 
measures proposed in the literature. Most results are averages over a number of 
runs, where each run involves splitting the dataset into disjoint training and test 
sets, and the test set is used to estimate several evaluation measures of the classifier 
generated using the corresponding training set. Afterwards, it is important to 
statistically verify the hypothesis of improved performance (or not) of the learning 
algorithm (Demsar 2006). However, we consider that the evaluation measures of 


the classifier constructed by a learning algorithm should also be compared with 
the ones obtained by a simple baseline classifier, as it is actually done by most 
of the single-label learning community. This way, case any of these measures are 
worse, it would encourage the community to provide additional explanation of this 
fact. 

In single-label learning , each example in the dataset is associated with only 
one class, which can assume several values. The task is called binary classification 
if there are only two possible class values (Yes/No), and multi-class classification 


when the number of class values is greater than two (Alpaydin 2010). 


For single-label learning, a simple baseline classifier is the one constructed by 
only taking into account the class values, i.e., it does not consider the attributes 
that describe the examples in the dataset. Having only this information, and due 
to the fact that the classification of a new instance has only two possible outcomes, 
correct or incorrect, the best it can do is to output a classifier that always predicts 
the most frequently occurring class value in the dataset. Before the single-label 
learning community started to pay attention to this very simple baseline classifier, 
many evaluation measures worse than or equal to this baseline classifier had been 
published in the scientific literature, without special explanations. 


In (Holte 19931 an experimental comparison involving 16 commonly used single¬ 


label datasets is carried out, where the error rate of the proposed algorithms are 
compared to the error rate of several learning systems reported in the literature. 
However, although the datasets used are not highly skewed, some of these reported 
results fail to improve the error rate of the simple baseline classifier. For example, 
considering two of these datasets, Breast Cancer and Hepatitis, from the collection 


distributed by the University of California at Irvine (Bache and Lichman 2013), 


33 and 75 error rates are compiled respectively. For the dataset Hepatitis 8 out 
of 33 (more than 24%) reported error rates are worse than or equal to the simple 
baseline classifier, while for the dataset Breast Cancer the same happens for 29 
out of 75 (more than 38%) reported error rates. As the simple single-label baseline 













Title Suppressed Due to Excessive Length 


3 


classifier is constructed by only looking at the class values, any learning algorithm 
which learns from non-skewed domains, and also takes into account the dataset 
attribute values should be able to construct a classifier with smaller error rates. 

Different to single-label learning, in multi-label learning an example can belong 
to several classes simultaneously. The main difference between multi-label learning 
and single-label learning is that classes in multi-label learning are often correlated, 
and the class values in single-label learning are mutually exclusive. Due to the 
increasing number of applications where examples are annotated with more than 
one class, multi-label learning has received increasing attention from the machine 
learning community ( Tsoumakas et ah] 2010). 

However, finding a simple multi-label baseline classifier by only looking at the 
multi-labels is not as straightforward as in single-label, where the classification of 
a new instance has only two possible outcomes, correct or incorrect, and the error 
rate is often considered an important single objective to be achieved. This is not 
the case in multi-label, as the evaluation measures of a multi-label classifier should 
also take into account partially correct classifications. To this end, many criteria 
are proposed to evaluate the classification performance from different perspectives. 
In (Dembczynski et al.|2012 ), the connection among these criteria are established, 
showing that some of these criteria are uncorrelated or even negatively correlated. 
In other words, some loss functions are essentially conflicting. Thus, several multi¬ 
label evaluation measures have been proposed, highlighting different aspects of 
this important characteristic of multi-label learning. 


Motivated by the lack of simple multi-label baseline classifiers, in (Metz et al. 


2012) we propose a simple way to construct multi-label baseline classifiers for spe¬ 


cific multi-label evaluation measures. Nevertheless, as a multi-label classifier which 
focuses on minimizing/maximizing one of these measures does not necessarily mi¬ 
nimize/maximize the others, we also proposed a unique simple baseline classifier, 
called General b, which does not focus on any one of these specific measures and 
can be used to determine all the multi-label evaluation measure baseline values of 
a classifier. 

Although we do not claim that the proposed General b multi-label baseline 
classifier should be the one to be used by the community whenever classifiers eval¬ 
uation measures are published, as other baseline classifiers could be proposed in 
the future, we believe that it is time to start a discussion related to this subject. 
Aiming to motivate the community, in this work we consider published experi¬ 
mental results which show that, similar to the single-label research primordium, 
some of the published results fail to improve on the ones obtained by our simple 
multi-label baseline classifier. 

However, unlike ([Holte] 1993), in which results reported on a dataset could also 
refer to the classifier generated by a learning algorithm using a slightly different 
dataset due to pre-processing, such as filter feature selection or other transforma¬ 
tion, in this work we only used the results published in papers reporting exper¬ 
imental results of classifiers which have been constructed using publicy available 
identical datasets. Unfortunately, this constraint leaves out a great deal of pa¬ 
pers, such as many related to text categorization, a typical multi-label problem, 
as most of the publicly available text datasets are modified by the authors in dif¬ 
ferent ways to obtain the final dataset from which the classifier is generated and, 
in most cases, this final dataset is not publicly available. On the other hand, this 
constraint enables anyone to reproduce the experiments described in these papers. 
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As there is a lack of reviews focusing on pieces of work which report experimen¬ 
tal results for multi-label learning, and the systematic review process can be useful 


results for multi-label learning. We have gathered the data used in this work from 
the selected publications which answer the systematic review research question 
and do not fulfill any of the exclusion criteria. 

More specifically, in this work we report on several statistics of various eval¬ 
uation measure values, which were published and obtained using the 10 datasets 
most frequently used in the selected papers. These statistics show that 12.8% of 
these published results are worse than or equal to the ones obtained by our simple 
multi-label baseline classifier Generals- Moreover, this percentage is unevenly dis¬ 
tributed among the datasets. In the “worst” dataset, 43.0% of such results were 
reported, and in the “best” one only 0.6%. However, although a simple baseline 
classifier was not considered in these publications, it was observed that even for 
very poor results no special explanations were provided in most of these publica¬ 
tions. 

The remainder of this paper is organized as follows: Section [2] briefly describes 
multi-label learning and the evaluation measures used in this work. Section [3] 
explains the simple baseline classifier Generals - The systematic review carried out 
to select the papers from which we have gathered the data used in this work is 
described in Section |4j and statistics of these published evaluation measure values 
are reported in Section [5] Section [6] presents the conclusions and future work. 


to identify related publications in a wide, rigorous and replicable way (Kitchenham 


et al. 2010), we used this process to identify publications which report experimental 


2 Multi-label Classification and Evaluation Measures 

Let D be a training set composed of N examples Ei = (x,,y'j), i = 1..7V. Each 
example .E,; is associated with a feature vector x.; = (xn,Xi 2 , ■ ■ ■ ,x^m) described 
by M features Xj , j = 1.. M , and a subset of labels Yi C L, where L = {yi, t /2 , ■ ■ ■ Vq } 
is the set of q labels. Table[l]shows this representation. In this scenario, the multi¬ 
label classification task consists of generating a classifier H, which given an unseen 
instance E = (x, ?), is capable of accurately predicting its subset of labels Y, i.e., 
H(E) ->■ Y. 


Table 1 Multi-label data 



The predominant approaches of multi-label learning methods are: algorithm 
adaptation and problem transformation (Tsoumakas et al. 2010). The first one 
consists of methods which extend specific learning algorithms in order to handle 
multi-label data directly. The second approach is algorithm independent, allowing 
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the use of any state-of-the-art single-label learning algorithm to carry out multi¬ 
label learning. It consists of methods which transform the multi-label classification 
problem into either several binary classification problems, such as the Binary Rel¬ 
evance ( BR ) approach, or one multi-class classification problem, such as the Label 
Powerset ( LP) approach. 

The BR approach decomposes the multi-label learning task into q independent 
binary classification problems, one for each label in L. To this end, the multi-label 
dataset D is first decomposed into q binary datasets D y .. j = l..q which are used 
to construct q independent binary classifiers. In each binary classification problem, 
examples associated with the corresponding label are regarded as positive and the 
other examples as negative. Finally, to classify a new multi-label instance BR out¬ 
puts the aggregation of the labels positively predicted by the q independent binary 
classifiers. As BR scales linearly with size q of the label set L, it is appropriate 
for not a very large q. Although in its simple form it experiences the deficiency 
in which correlation among the labels is not taken into account, successful at¬ 


tempts have been made to model correlation using binary classifiers (Read et al. 


2009 Tsoumakas et al. 2009 Cherman et al. 20121. On the other hand, the LP 


approach transforms the multi-label learning task into a multi-class learning task 
considering every unique combination of labels in a multi-label dataset as one class 
value of the corresponding multi-class dataset. Unlike BR, LP takes into account 
correlation among the labels. 

Evaluating the performance of multi-label classifiers is difficult mostly because 
multi-label prediction has an additional notion of being partially correct. To this 
end, several measures have been proposed for the evaluation of bipartitions and 
rankings with respect to the ground truth of multi-label data. A complete discus¬ 
sion on these performance measures is out of the scope of this paper, and can be 
found in ( Tsoumakas et aL] |~2010l. 

Measures that evaluate bipartitions are further divided into example-based and 
label-based. The former are calculated based on the average differences of the classi¬ 
fier predicted multi-label of all examples in the test set, while the latter decompose 
the evaluation process into separate evaluations of each of the q labels, which are 
afterwards averaged on all labels. In what follows, we briefly describe the measures 
used in this work to evaluate bipartitions. 


2.1 Example-based 

Let Yi be the set of true labels (true multi-label) and Zi be the set of predicted 
labels (predicted multi-label). Hamming-Loss is defined by Equation [lj where A 
represents the symmetric difference between two sets. 

N 

Hamming-Loss(H, D) = i (1) 

2=1 

Hamming-Loss evaluates the frequency that labels in the multi-label are misclas- 
sified, i.e., the example is associated to a wrong label or a label belonging to the 
true instance which is not predicted. 

Subset-Accuracy is defined by Equation [2j where /(true) = 1 and /(false) = 0. 
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Subset-Accuracy(H, D ) 


( z * = y *> 


( 2 ) 


Subset-Accuracy is a very strict evaluation measure as it requires 
of the predicted and the true set of labels. 

In (Godbole and S arawagi| [2004|), the following definitions for 
sion and Recall, defined by Equations [3] [4] and [5] respectively, are 


an exact match 

Accuracy, Preci- 
proposed. 


N 

Accuracy(H,D) = N J2\ YtljZi \ 

i =1 

( 3 ) 

N 

Precision(H,D) = N ^ 
i= 1 

( 4 ) 

JV . 


n 77 / 7f 1 Tj Pi Zi\ 

Recall(H,D)= ^ 

( 5 ) 


Accuracy is the proportion of the correctly predicted labels to the total number 
of labels in the predicted and the truth label set of an instance. Precision is the 
proportion of correctly predicted labels to the total number of predicted labels, 
and Recall is the proportion of correctly predicted labels to the total number of 
true labels. 

F-Measure, frequently used as performance measure for information retrieval 
systems, is the harmonic mean of Precision and Recall, defined by Equation [6j 


N 

, . 1 v—v 2 x \Yj, n ZA . . 

F-Measure(H, D) = - £ + (6) 

i= 1 

All these performance measures have values in the interval [0.. 1]. For Hamming- 
Loss, the smaller the value, the better the multi-label classifier performance is, 
while for the other measures, greater values indicate better performance. 


2.2 Label-based 


In this case, for each single label yi € L, the q binary classifiers are initially eval¬ 
uated using any one of the binary evaluation measures proposed in the literature, 
such as Accuracy, F-Measure, ROC and others, which are afterwards averaged 
over all labels. Two operations, macro-averaging and micro-averaging, can be used 
to average over all labels. 

Let B [Tp , Fp .,Tjv ., EV .) be a binary evaluation measure calculated for 
a label y t based on the number of true positive (Tp), false positive (Fp), true 
negative (T,v) and false negative (Fn)- The macro-average version of B is defined 
by Equation [ 7 ] and the micro-average by Equation [8] 


B 


1 

q 



i=1 


'macro 


( 7 ) 
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( 8 ) 


Thus, the binary evaluation measure used is computed on individual labels 
first and then averaged for all labels by the macro-averaging operation, while 
it is computed globally for all instances and all labels by the micro-averaging 
operation. This means that macro-averaging would be more affected by labels 
that participate in fewer multi-labels, i.e., fewer examples, which is appropriate 
in the study of unbalanced datasets (Dendamrongvit et al. 2011). Furthermore, it 
should be observed that for some binary evaluation measures, such as Accuracy, 
macro-average and micro-average yield the same result. 


3 The Simple Multi-label Baseline Classifier Generals 


In supervised learning, simple baseline classifiers can be constructed by only look¬ 
ing at the class, i.e., ignoring any other information from the dataset. The single¬ 
label learning community frequently uses as a reference the one which always pre¬ 
dicts the majority class (the most frequent class value). Although a classifier might 
perform worse than this simple baseline classifier, as could be the case when learn¬ 
ing from highly skewed domains, this behaviour requires a special explanation. In 
multi-label learning, as the LP transformation maps each distinct multi-label into 
a single-label, transforming the multi-label dataset into a multi-class (single-label) 
dataset, one could argue why not use the one which always predicts the majority 
multi-label as the most simple multi-label baseline classifier?. Although it is a possible 
baseline, which focuses on maximizing Subset-Accuracy , this strategy does not take 
into account the individual label distribution in the multi-labels, which provides 
additional information. 

Moreover, due to the fact that multi-label prediction has the notion of being 
partially correct, several multi-label evaluation measures have been proposed to 
evaluate the classification performance from different perspectives. 

In (Metz et al. 2012), we propose specific simple baseline classifiers which 


are tailored to maximize/minimize one specific multi-label measure at a time. 
However, a specific baseline classifier tailored to maximize/minimize one measure 
does not necessarily maximize/minimize the other measures. Nevertheless, having 
different baseline classifiers to consider would be a cumbersome task, due to the 
number of different multi-label evaluation measures proposed in the literature, as 
well as multi-label learning algorithms which are tailored to maximize/minimize 
more than one measure (multi-objective). In this work we also proposed Generals, 
a simple baseline classifier which does not focus in maximizing/minimizing any 
specific measure, and which can be used to find general baselines for any bipartite 
multi-label evaluation measure. 

The rationale behind Generals to find the predicted multi-label Z is very 
simple. It consists of ranking the single-labels in L according to their individual 
relative frequencies in the multi-labels, and then, the a most frequent single-labels 
are included in Z. We are then left with the problem of choosing o such that Z is 
representative, i.e., with a reasonable number of single-labels and at the same time 
avoiding being too strict (including too few single-labels) or too flexible (including 
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too many single-labels). As we are interested in finding Z that represents the single¬ 
label distribution in the multi-labels well, we defined a as the closest integer value 
of the label cardinality. The label cardinality, defined by Equation [9j represents 
the average size of the multi-labels in a dataset D composed of N examples, i.e., 
the average number of single-labels associated to each instance. 


1 N 

CR{D) = v ^ 


( 9 ) 


In case of ties (single-labels with the same frequency), we consider the label co¬ 
occurrence measure, choosing the one which maximizes its co-occurrence with the 
better ranked labels. It should be observed that, as every other learner, Generals 
has a particular bias. For instance, it will work well for Subset-Accuracy whenever 
there is a positive correlation among the most frequent labels. However, it will 
not work well when the correlation is negative. In other words, it will work better 
whenever its bias fits the dataset well. 

The specific baseline classifiers, as well as Generals , were implemented using 


the Mulan framework (Tsoumakas et ah 2011), a Java package for multi-label 
classification based on WekfQ which is commonly used by the multi-label learning 
community. 

An analysis of several multi-label bipartition evaluation measure baselines ob¬ 
tained by the specifics, as well as by the Generals baseline classifier showed that, 
as expected, the specific ones perform better on the measure they try to maxi¬ 
mize/minimize, although they degrade on the other measures. On the other hand, 
Generals shows a reasonable performance for all the considered bipartite measures. 
Ranking the results obtained by the specific baseline classifiers and by Generals , 
it was observed that Generals is ranked “in the middle”, as shown in (Metz et al. 


2012), making it suitable to be used as a general baseline multi-label classifier. 


4 Systematic Review 


Although multi-label classification has drawn increasing attention from the ma¬ 
chine learning and data mining communities in the past decade, there are few 
extensive reviews researching publications on this topic. Moreover, to the best of 
our knowledge, there is no extensive review which explicitly focused on papers 
reporting experimental results for multi-label learning. 

To this end, as we need published experimental results on evaluation measures 
of multi-label classifiers to compare with our proposed multi-label baseline classi¬ 
fier Generals, we have carried out the Systematic Review (SR) process, a method 
to search for relevant papers in a wide, rigorous and replicable way (Kitchenham 
et al.| |2010 ). The SR process is able to answer Research Questions (RQ) about a 
subject by using a protocol of planned activities to identify, select and summarize 
relevant pieces of work. 

The aim of our systematic review, which is reported in (Spo laor et al.|[2013 l, 
is to answer the following RQ: what are the publications which report experimen¬ 
tal results for multi-label learning research?. To this end, we used nine world wide 


http://www.cs.waitako.ac.nz/ml/weka 
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online bibliographic database search engines as listed in Table [2} in which 1,543 
publications were identified. 


Table 2 Bibliographic databases used in the systematic review 


Database 

URL 

ACM Portal 

http://portal.acm.org 

CiteSeer 

http://citeseerx.ist.psu.edu 

Interscience 

http://onlinelibrary.wiley.com 

ScienceDirect 

http://www.sciencedirect.com 

Scirus 

http://scirus.com 

Scopus 

http://www.scopus.com 

SpringerLink 

http://link.springer.com 

Xplore 

http://ieeexplore.ieee.org 

Web of Science 

http://isiknowledge.com 


Some retrieved pieces of work can be duplicated, as some sources, i.e., journals, 
proceedings and others, are indexed by more than one bibliographic database. 
Thus, cases with duplicated titles were automatically or manually (mistyped titles) 
removed, keeping only one copy of the publication. From the 1,543 publications, 
847 (55%) were automatically removed and 79 (5%) were manually removed. Thus, 
we were left with 617 (40%) publications which were divided among the authors 
of this paper, such that each one of the 617 publications was manually analyzed 
using 16 exclusion criteria as a guide. Whenever a publication fulfilled one or 
more exclusion criteria, it was removed. If there were doubts about removing a 
publication, a second reviewer verified the doubtful publication. 

The 16 exclusion criteria include: publications which do not consider example- 
based or label-based evaluation measures; restricted access to the dataset; pre- 
processed datasets where the final attribute-value table used by the learning algo¬ 
rithm is not publicly available, and others. Recall that we only collected evaluation 
measures of classifiers that were obtained by multi-label learning algorithms using 
identical attribute-value datasets. At this stage, we were left with 64 (4%) pub¬ 
lications which do not fulfill any of the 16 exclusion criteria. Figure [T] shows a 
summary of the selection procedures. 

Nevertheless, similar to known systematic reviews (Kitchenham et al. 2010), 
results are bound to the electronic databases searched for publications, nine in our 
case. Thus, papers potentially relevant to the research question might not have 
been identified. 

From these 64 papers, we recorded information extracted manually on an elec¬ 
tronic spreadsheet with 42 columns, described in detail in (Spolaor et al. 2013). 
As most of the information extraction has to be carried out manually, this process 
was double checked. A relational database was set up to appropriately record the 
42 columns, modelling each sheet as a database table. In this database, each sheet 
column is a table attribute and each sheet line is an instance. The corresponding 
entity-relationship model consists of four tables: main, dataset, measure and paper, 
as well as some relationships between them. The main table records the experi¬ 
mental settings and results published in the papers which are able to answer the 
research question, as well as some foreign keys which link results to a paper and 
a dataset. Furthermore, the dataset table records usual statistics from multi-label 
datasets, such as: the domain; number of examples; features and labels; as well as 
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4 % 



Duplicated title (automatically 
removed) 

Duplicated title (manually removed) 
Papers fulfilling exclusion criteria 
Remaining papers 


Fig. 1 Summary of the selection procedures to find the relevant papers used in this work from 
a total of 1,543 publications 


the number of different multi-labels, label cardinality and label density; the URL 
where the dataset is publicly available is also kept in this table. The measure table 
manages the name and type of the recorded multi-label evaluation measures. The 
paper table records the selected publications. 

Figure [2] shows the distribution per year of the 64 papers considered in this 
work, where 21 (33%) of them were published in journals and the remaining in 
congresses and workshops. Moreover, besides 7 (10%) papers published in the 
Machine Learning Journal, at most 2 papers were published in the same source. 


25% 



Fig. 2 Percentage of the 64 papers published per year 


Figure [3] shows the number of papers in which each dataset was used. As 
already mentioned, we do not consider experimental results from pre-processed 
datasets, such as the very frequently used Reuters, unless the final attribute-value 
table from which the classifier is generated is reported. Thus, few datasets whose 
original domain is text were considered in this study due to this restriction. As 
can be observed in this figure, the Yeast dataset is used in almost 80% of the 64 
papers considered in this work. 
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Fig. 3 Number of papers using each multi-label dataset 


5 Comparing Generals to Published Evaluation Measure Values 

In this section, some statistics of published experimental evaluation measure values 
of multi-label classifiers and the ones obtained by Generals are discussed. 


5.1 Datasets 


From the 25 datasets used in the 64 papers selected by the systematic review 
process shown in Figure [3j the 10 most frequently used are the ones considered. 

Table [3] presents the selected datasets and associated statistics. It shows: the 
application domain (Domain); number of instances (#E); number of features (#F); 
the total number of labels (|L|); and the percentage of the 64 publications which 
use the dataset (Usage). Moreover, Table [4] shows some statistics associated with 
the datasets labels: label cardinality ( CR{D )), defined by Equation [9j label den¬ 
sity ( DS(D )), defined by Equation |10| number of distinct multi-labels (#Dist); 
the lowest (Min) and the highest (Max) single-label frequencies, as well as the 
first (IQ), second (median Med) and the third quartiles (3Q), as suggested by 


Tsoumakas (2013). 


N 

DS < D > = v £ W (10) 

2=1 

Observe that these 10 datasets from five different domains have different char¬ 
acteristics. The number of instances vary from 593 up to 43,907; the number of 
features from 72 up to 1,836 and the number of single-labels (|L|) from 6 up to 
374. Furthermore, the label cardinality varies from 1.074 up to 4.376; the label 
density from 0.009 up to 0.311 and the number of distinct multi-labels from 15 
up to 6,555. It is worth noting that some datasets present labels with very low or 
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Table 3 Datasets and associated statistics 


Dataset 

Domain 

#E 

#F 

w 

Usage 

Bibtex 1 

text 

7395 

1836 

159 

17% 

Corel5k 1 

image 

5000 

499 

374 

11% 

Emotions 1 

music 

593 

72 

6 

47% 

Enron 1 

text 

1702 

1001 

53 

33% 

Genbase 1 

biology 

662 

1186 

27 

20% 

Mediamill 1 

video 

43907 

120 

101 

16% 

Medical 1 

text 

978 

1449 

45 

31% 

Scene 1 

image 

2407 

294 

6 

61% 

Slashdot 2 

text 

3782 

1079 

22 

11% 

Yeast 1 

biology 

2417 

103 

14 

80% 

web link: 

1 http://mulan.sourceforge 

.net/datasets.html 


"http://meka.sourceforge.net/ 



Table 4 Labels’ associated statistics 


Dataset 

CR(D) 

DS(D) 

#Dist 

Min 

iQ 

Med 

3Q 

Max 

Bibtex 

2.402 

0.015 

2856 

51 

61 

82 

129 

1042 

Corel5k 

3.522 

0.009 

3175 

1 

6 

15 

39 

1120 

Emotions 

1.869 

0.311 

27 

148 

166 

170 

185 

264 

Enron 

3.378 

0.064 

753 

1 

13 

26 

107 

913 

Genbase 

1.252 

0.046 

32 

1 

3 

17 

49 

171 

Mediamill 

4.376 

0.043 

6555 

31 

93 

312 

1263 

33869 

Medical 

1.245 

0.028 

94 

1 

2 

8 

34 

266 

Scene 

1.074 

0.179 

15 

364 

404 

429 

432 

533 

Slashdot 

1.181 

0.054 

156 

0 

26 

179 

250 

584 

Yeast 

4.237 

0.303 

198 

34 

324 

659 

953 

1816 


zero frequency (Min). Although it could be a good practice to remove these sort 
of labels, the original versions of the datasets were kept in this work. More detail 
about these datasets can be found in the site where they are publicly available. 


5.2 Evaluation measures 

As explained in Section [4] all example-based and label-based measure values re¬ 
ported in the 64 papers were manually collected. Similar to the dataset selection, 
we chose the 8 most frequently used out of the 17 different recorded evaluation 
measures. Moreover, the 9 measures not considered here are used in very few (at 
most 5) of the 64 papers. Figure [4] shows the number of papers in which the 8 
evaluation measures considered were used. 

As can be observed, at least in the papers considered in this work, example- 
based measures are much more frequently used than the label-based ones. Further¬ 
more, among the example-based measures, Hamming-Loss is the most frequently 
used (55 papers), while Subset-Accuracy is used in fewer papers. As already men¬ 
tioned, Hamming-Loss evaluates partially correct classification, while Subset-Accu¬ 
racy evaluates exact matching between the ground truth and the predicted multi¬ 
label. 

The results reported on these measures come from different experimental setup 
and validation processes, such as cross-validation and hold-out. Considering all the 
64 papers, it was observed that 49.6% were obtained using hold-out validation, 
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Fig. 4 Number of papers using each evaluation measure 


■ Example-based 

■ Label-based 



43.0% using some type of fc-fold cross-validation, such as 10-fold, 3-fold or 5 x 2- 
fold, and the rest using other validation process or the validation process is not 
explicitly mentioned in the respective publication. However, it is interesting to note 
that from all the measure values which we have manually extracted and recorded, 
a total of 6,989, the standard deviation is reported for only 1,435 (20.5%) of them. 

Next, statistics related to the published measure values considered in this work, 
which were obtained from the classifiers generated by a variety of multi-label 
learning algorithms, and the ones obtained by Generals, are presented. 


5.3 Results and discussion 

Table [5] shows the Generals baseline values for each evaluation measure and 
dataset considered in this work. The eight measures are denoted as: Example-based 
Accuracy ( Acc ); Example-based F-Measure ( El); Example-based Hamming-Loss ( HL ); 
Example-based Precision ( Pr ); Example-based Recall (Re); Example-based Subset-Ac¬ 
curacy ( SAcc ); Label-based Macro-averaged F-Measure ( F1 M ); and Label-based Micro- 
averaged F-Measure ( FI ^). 

These values can be directly used in other publications, as they are the same 
for a given dataset and evaluation measure. 

Table [6] shows, for each dataset, the number of times that a published measure 
value underperforms or it is equal to the corresponding Generals baseline value. 
Summary information is shown in light gray cells. Column #(7^ shows the total 
number of measures fulfilling this condition on a total of measure values 

recorded for each dataset, and column % shows the percentage. Similar results are 
shown in rows #t/ m , #M m and % for each measure considered. This information 
is shown graphically in Figures [5] and [6] 

As can be observed, from a total of 5,342 measure values on the 10 datasets 
considered in this work, 12.8% are worse than or equal to the ones provided by 
Generals- Moreover, these worse results are concentrated in some datasets, such 
as Corel5k , Mediamill and Enron, as shown in Figure [6] On the other hand, only 4 
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Table 5 Generals baseline measure 

values 






Dataset 

Acc 

FI 

HL 

Pr 

Re 

SAcc 

F1 m 

FE 

Bibtex 

0.07 

0.10 

0.03 

0.11 

0.11 

0.00 

0.00 

0.10 

Corel5k 

0.12 

0.18 

0.01 

0.20 

0.17 

0.00 

0.00 

0.19 

Emotions 

0.23 

0.30 

0.33 

0.45 

0.23 

0.07 

0.10 

0.31 

Enron 

0.30 

0.42 

0.07 

0.48 

0.39 

0.00 

0.04 

0.45 

Genbase 

0.26 

0.26 

0.06 

0.26 

0.26 

0.26 

0.02 

0.23 

Mediamill 

0.35 

0.50 

0.04 

0.53 

0.53 

0.00 

0.03 

0.50 

Medical 

0.21 

0.23 

0.04 

0.27 

0.21 

0.16 

0.01 

0.24 

Scene 

0.19 

0.20 

0.27 

0.22 

0.19 

0.17 

0.06 

0.21 

Slashdot 

0.15 

0.15 

0.09 

0.15 

0.15 

0.14 

0.10 

0.14 

Yeast 

0.42 

0.55 

0.26 

0.58 

0.55 

0.05 

0.21 

0.57 


Table 6 Number of measure values which underperform or are equal to the corresponding 
General b baseline value (sorted by %) 

Dataset 

Acc 

FI 

HL 

Pr 

Re 

SAcc 

F1 M 

FE 

#u d 

#M d 

% 

Corel5k 

18 

30 

13 

8 

13 

12 

4 

9 

107 

249 

43.0 

Mediamill 

24 

23 

19 

10 

25 

2 

3 

12 

118 

311 

37.9 

Enron 

32 

33 

34 

18 

20 

6 

4 

4 

151 

606 

24.9 

Slashdot 

11 

5 

11 

5 

4 

9 

5 

0 

50 

266 

18.8 

Yeast 

17 

20 

52 

27 

23 

2 

3 

11 

155 

1094 

14.2 

Bibtex 

5 

5 

16 

4 

6 

0 

0 

0 

36 

326 

11.0 

Genbase 

6 

3 

4 

5 

3 

0 

0 

0 

21 

346 

6.1 

Medical 

6 

4 

5 

6 

2 

3 

0 

0 

26 

540 

4.8 

Scene 

3 

1 

4 

2 

3 

1 

0 

0 

14 

888 

1.6 

Emotions 

2 

0 

1 

1 

0 

0 

0 

0 

4 

716 

0.6 

*Um 

124 

124 

159 

86 

99 

35 

19 

36 





907 

782 

1355 

580 

580 

490 

245 

403 




% 

13.7 

15.9 

11.7 

14.8 

17.1 

7.1 

7.8 

8.9 





Recall 
F-measure 
Precision 
Accuracy 
Hamming-Loss 
Micro F-measure 
Macro F-measure 
Subset-Accurary 


Avg 12.8% underperforming the baseline 

■ 658 124 


186 


BE! 


1196 

159 

367 36 


226 19 


455 35 



0 200 400 600 800 1000 1200 1400 


■ Outperforming 
Underperforming 



Number of papers 

Fig. 5 Overall performance per measure using General g as reference 


out of 716 (0.6%) of the measure values published for the Emotions dataset fulfill 
this condition. Figure [7] shows information of these four datasets. 

Nevertheless, this kind of information does not show the degree of disagree¬ 
ment between the evaluation measure values published and the ones provided by 
Generals- To this end, we have extracted statistics from these values, as shown 
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Recall 
F-measure 
Precision 
Accuracy 
Hamming-Loss 
Micro F-measure 
Macro F-measure 
Subset-Accurary 



Percentage of experiments 


Datasets 


Fig. 6 Percentage of values per measure and dataset which underperform or are equal to the 
corresponding Generalg baseline value 


in Figure [8] for the datasets Corel5k, Mediamill, Enron and Emotions considering 
the distribution of Accuracy, F-Measure and Hamming-Loss measure values. It also 
shows, in brackets, the worst and the best value found in the publications. Recall 
that for Hamming-Loss, the smaller the value, the better the multi-label classifier 
performance is, while for the others, greater values indicate better performance. 

In fact, this sort of statistics extraction and organization was carried out for 
all datasets and measures considered, and can be found at http://www.iabic.icmc. 
usp.br/pub/mcmonard/ExperimentalResults/Metz-GeneralB-SupplementaryMaterial 

Figure [8] shows that, in some cases, there is a considerable gap between the 
worst and the best published measure values. Although this gap could be justified 
because different multi-label algorithms minimize different loss-functions, which 
in turn favors specific evaluation measures, it should be expected that special 
explanations are provided case these measures are worse than the ones from the 
simple baseline classifier Generalg. 

Furthermore, considering in Figure [8] the measures which are better than or 
equal to the ones from Generals, it can be observed that there is little improvement 
in those measures for Corel5k, Mediamill and Enron datasets. On the other hand, 
the improvement is considerable for Emotions. 

Table [7] shows, for the 10 datasets, the highest (t) and the lowest ( 4 ,) measure 
values published in the 64 papers for the 8 evaluation measures considered in this 
work, as well as the ones from Generals ( Gs )■ Light gray cells indicate that the 
difference between the highest and the lowest measure values is greater than or 
equal to 0.5. In most cases, it can be observed that there is a very high discrepancy 
between the highest and the lowest published measure values. 

Regarding the multi-label algorithms used in the 64 papers, most of them follow 
the problem transformation approach, using state-of-the-art single-label learning 
algorithms as a base learner. Binary Relevance is the most frequently used ap¬ 
proach. 

At this point, it is worth observing that we are quite confident about the 
correctness (with respect to the published results) of the collected measure values 
from the 64 papers. As stated earlier in Section|4j these values were initially double 
checked. After making the graphs for all datasets and measures considered in this 
work, we checked, once more, the worst and the best published values. 
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Avg. 43.0% underperforming the baseline 

■ Outperforming ■ Underperforming 


Recall 
F-measure 
Precision 
Accuracy 
Hamming-Loss 
Micro F-measure 
Macro F-measure 
Subset-Accurary 


Avg. 37.9% underperforming the baseline 

■ Outperforming ■ Underperforming 


Recall 
F-measure 
Precision 
Accuracy 
Hamming-Loss 
Micro F-measure 
Macro F-measure 
Subset-Accurary 


30 40 50 60 



(a) Corel5k 


(b) Mediamill 


Avg. 24.9% undeperforming the baseline 

■ Outperforming ■ Underperforming 


Avg. 0.6% underperforming the baseline 

■ Outperforming ■ Underperforming 


Recall 
F-measure 
Precision 
Accuracy 
Hamming-Loss 
Micro F-measure 
Macro F-measure 
Subset-Accurary 



0 50 100 



(c) Enron 


(d) Emotions 


Fig. 7 Overall performance per measure of datasets Corel5k, Mediamill, Enron and Emotions 


From this third inspection of the gathered data, it was observed that few papers 
explain and justify very poor results. However, similar to single-label learning, 
case the multi-label community decides to adopt a simple baseline classifier such 
as General b, or any other, we think that it will encourage the authors to provide 
special explanations on very poor results. 


6 Conclusions and Future Work 

The single-label community expects that in non skewed domains a simple baseline 
classifier, which always predicts the majority class, should do worse than classifiers 
constructed by a learning algorithms. However, to the best of our knowledge, the 
multi-label community still does not have a consolidated idea of a simple multi¬ 
label baseline classifier. 

Aiming to raise awareness of considering a simple multi-label baseline classifier, 
we have carried out a systematic review of the multi-label learning literature in 
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[worst: 0.01, best: 0.17] 

^—Experiments ^—GeneralB 



[worst: 0.00, best: 0.29] [worst: 0.21, best: 0.01] 

^—Experiments ^—GeneralB ^—Experiments ^—GeneralB 




(a) Corel5k - Accuracy 


(b) Corel5k - F-Measure ( c ) Corel5k 

Hamming-Loss 


[worst: 0.04, best: 0.45] 

—Experiments — GeneralB 



[worst: 0.04, best: 0.60] 



1 9 17 25 33 41 


[worst: 0.37, best: 0.03] 

—Experiments —GeneralB 



(d) Mediamill - (e) Mediamill 

Accuracy F-Measure 


(f) Mediamill 
Hamming-Loss 


[worst: 0.05, best: 0.47] 

^—Experiments ^—GeneralB 


[worst: 0.14, best: 0.61] 

^—Experiments ^—GeneralB 


[worst: 0.50, best: 0.02] 

^—Experiments ^—GeneralB 





(g) Enron - Accuracy 


(h) Enron - F-Measure (0 Enron 

Hamming-Loss 


[worst: 0.22, best: 0.60] 

^—Experiments ^—GeneralB 



[worst: 0.42, best: 0.70] 

^—Experiments ^—GeneralB 



[worst: 0.40, best: 0.18] 

^—Experiments ^—GeneralB 



(j) Emotions 
Accuracy 


(k) Emotions 
F-Measure 


(1) Emotions 
Hamming-Loss 


Fig. 8 Distribution of Accuracy , F-Measure and Hamming-Loss evaluation measures values 
for datasets Corel5k , Mediamill , Enron and Emotions 
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Table 7 Highest (i~) and lowest if) values reported for each dataset and the corresponding 
Generalg (Gg) values 


Dataset 

t 

i 

G b 

t 

i 

G b 

t 

t 

Gb 

t 

t 

Gb 



Acc 



FI 



HL 



Pr 


Corel5k 

0.17 

0.01 

0.12 

0.29 

0.00 

0.18 

0.21 

0.01 

0.01 

0.62 

0.00 

0.20 

Mediamill 

0.45 

0.04 

0.35 

0.60 

0.04 

0.50 

0.37 

0.03 

0.04 

0.80 

0.06 

0.53 

Enron 

0.47 

0.05 

0.30 

0.61 

0.14 

0.42 

0.50 

0.02 

0.07 

0.73 

0.13 

0.48 

Slashdot 

0.53 

0.01 

0.15 

0.56 

0.05 

0.15 

0.95 

0.04 

0.09 

0.71 

0.03 

0.15 

Yeast 

0.57 

0.33 

0.42 

0.69 

0.33 

0.55 

0.30 

0.08 

0.26 

0.75 

0.34 

0.58 

Bibtex 

0.38 

0.01 

0.07 

0.46 

0.03 

0.10 

0.21 

0.01 

0.03 

0.64 

0.02 

0.11 

Genbase 

0.99 

0.00 

0.26 

1.00 

0.02 

0.26 

0.99 

0.00 

0.06 

1.00 

0.00 

0.26 

Medical 

0.80 

0.01 

0.21 

0.83 

0.03 

0.23 

0.97 

0.01 

0.04 

0.84 

0.01 

0.27 

Scene 

0.77 

0.00 

0.19 

0.79 

0.17 

0.20 

0.41 

0.08 

0.27 

0.91 

0.00 

0.22 

Emotions 

0.60 

0.22 

0.23 

0.70 

0.42 

0.30 

0.40 

0.18 

0.33 

0.74 

0.43 

0.45 



Re 



SAcc 



F1 m 



Fi» 


Corel5k 

0.51 

0.00 

0.17 

0.02 

0.00 

0.00 

0.04 

0.00 

0.00 

0.29 

0.00 

0.19 

Mediamill 

0.70 

0.05 

0.53 

0.12 

0.00 

0.00 

0.19 

0.00 

0.03 

0.63 

0.01 

0.50 

Enron 

0.81 

0.07 

0.39 

0.22 

0.00 

0.00 

0.17 

0.01 

0.04 

0.60 

0.35 

0.45 

Slashdot 

0.71 

0.01 

0.15 

0.44 

0.00 

0.14 

0.24 

0.01 

0.10 

0.50 

0.42 

0.14 

Yeast 

0.82 

0.32 

0.55 

0.28 

0.04 

0.05 

0.87 

0.03 

0.21 

0.85 

0.04 

0.57 

Bibtex 

0.65 

0.05 

0.11 

0.27 

0.06 

0.00 

0.32 

0.05 

0.00 

0.46 

0.12 

0.00 

Genbase 

1.00 

0.00 

0.26 

0.98 

0.96 

0.26 

0.00 

0.00 

0.00 

0.99 

0.98 

0.23 

Medical 

0.94 

0.03 

0.21 

0.78 

0.00 

0.16 

0.37 

0.02 

0.01 

0.81 

0.34 

0.24 

Scene 

0.95 

0.00 

0.19 

0.74 

0.17 

0.17 

0.78 

0.51 

0.06 

0.77 

0.52 

0.21 

Emotions 

0.79 

0.28 

0.23 

0.35 

0.08 

0.07 

0.73 

0.37 

0.10 

0.73 

0.44 

0.31 


order to collect experimental results to contrast with the proposed simple multi¬ 
label baseline classifier Generals- 

It was found that an important number of published results (12.8%) are worse 
than or equal to the ones obtained by Generals- In fact, for all the 10 most 
frequently used datasets presented in the work, results worse than or equal to the 
ones obtained by Generals were found. In the extreme case, 43% of the published 
results for one dataset are worse than or equal to the Generals results. 

Although we do not claim that the proposed Generals multi-label baseline 
classifier should be the one to be used by the community, we hope that this work 
would encourage the multi-label community to consider the idea of using a simple 
baseline classifier as an initial reference related to the learning power of multi-label 
algorithms. With the use of a baseline, built by only taking into account the label 
distribution information, it would be possible to identify cases where the obtained 
results are not reasonable enough, and give support for better explanations about 
these results. 

As future work, we plan to increase the number of electronic databases to 
search for publications which answer our research question and do not fulfil any 
of the exclusion criteria. As the organization of the information extracted allows 
to answer several useful questions, such as Which publications use algorithm A on 
dataset B using 10-fold cross-validation and what are the results obtained? Are there 
publications reporting results on datasets with cardinality greater than C and a distinct 
number of multi-labels greater than W?, we plan to increment and further structure 
the gathered information making it available to the community on a Web page. 
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